Successful Information Integration requires People Integration

نویسندگان

  • M. Klumpp
  • M. Roth
چکیده

Information integration technology has been a fertile ground for research over the past two decades and has led to a number of commercially available products. However, the call for papers for this workshop rightly questions why the use of this technology is not as widely adopted as one might expect in this information age. In this paper, we take the position that the answer to this question lies not in examining what gaps might exist in the underlying integration technology, but rather, that the answer might instead lie in examining the gaps that exist between the people who are tasked with information integration activity itself. 1.INTRODUCTION The classic data integration application focuses on extracting data from heterogeneous source systems, combining and transforming it in some way, and either delivering it in real-time or moving it into a consolidated target system for further analysis. Information integration technology that supports these applications has been a fertile ground for research over the past two decades and has led to a number of commercially available products. We believe that the years of research investment in federation, data warehousing, schema mapping, content management, etc., has paid off and that the underlying technology has reached a 'good enough' stage. However, all of this technology is focused on bringing more sophisticated and powerful tools to the IT staff. In our experience, the IT staff does not make up the whole team that is responsible for producing integrated information. They are part of a larger team, many of whom are called to action long before the IT staff becomes involved. This larger team includes, for example, end users, who originate the need for the information, and business analysts, who translate business requirements into actionable specifications for the IT staff to act upon. Successful information integration projects require a shared understanding of the business problem for the larger team to quickly and correctly produce an application that integrates information from multiple sources. However, in practice, we have seen that most of the development time is spent not on designing or programming the application, but in managing and translating differences in languages, skills and tools between the involved subject matter experts. We believe that successful information integration requires successful people integration. In this paper, we describe a system called Dot2Dot that extends existing schema mapping technology [6] to provide an environment to automate information flow for information integration projects between participating user roles. Each role is allowed to approach the problem with the tool and information format of their choice. When the time comes to flow information from one role to the next, the information is automatically translated into the preferred representation of the next involved user role, reducing opportunities for misunderstandings and false starts. 2.MOTIVATION The Dot2Dot system addresses three important roles involved in building data integration applications: the business analyst who creates a specification that describes how to compute the integrated target information from the various source systems, and the data modeler and developer who implement the data models and application code to produce the integrated data. Figure 1: Collaboration with Dot2Dot. Figure 1 provides a simplified overview of how these different roles interact. Business users with a need for integrated information will typically use email and verbal communication to communicate their requirements at a high level to business analysts. The analysts translate these requirements into source-to-target mapping specifications using spreadsheets. Data modelers build or modify data models that represent source and target data, and developers, sometimes working with software architects, design and write code to implement the logic described in the specifications to populate data sources described by the data models. With the Dot2Dot system, the interaction between roles is enabled by a common collaboration model from which work items for the various roles can be automatically derived. Such work items include spreadsheet-like source-to-target mappings specifications, draft data models, draft SQL, draft XSLT or draft ELT jobs. In the next section, we briefly describe the collaboration model and in subsequent sections we show how it automates the information flow and supports the perspectives for each of the three user roles. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB ’08, August 24-30, 2008, Auckland, New Zealand. Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. 3.THE COLLABORATION MODEL Much research has gone into schema mapping technology over the past several years, and a good overview and references are provided in [4] [6]. As described in [6], schema mapping techniques almost always employs a semantic model to capture the mapping semantics of the sources and targets. In the Dot2Dot system, we have extended the use of this model beyond its original purpose as a mapping model to be a collaboration model. Dot2Dot uses an extended version of the Clio [8] mapping model as the shared collaboration model between the different roles. This model captures the semantics of the integration application as a specification in a neutral format that can be surfaced in different tools by translating it accordingly into the different perspectives of each of the users. The collaboration model and its associated metadata are stored in a central repository so that it can be accessed from the various tools used by the collaborators. Figure 2 illustrates the model. A Mapping is an element that can have 0...n sources and 0...m targets. Both the sources and targets are represented as Resources that point to metadata objects in the central repository, such as business categories and terms, and physical schemas, tables, and columns. A set of source-to-target mappings are grouped under a Specification. Figure 2: Collaboration model. A mapping can contain one or more Expressions to describe how the source resources associated with the mapping should be transformed into the target resources of the mapping. For example, an expression might contain a join condition between two or more source tables, or it might express a data type conversion from one type to another. In addition to expressions, BusinessRules provide additional information about the mapping, such as a description of the mapping in business terms. Figure 3 illustrates an instance of a collaboration model representing a specification for a simple scenario. Sources and targets of mappings are connected to the metadata model stored in a repository. Source models refer to two physical tables for ‘All_Cust’ and ‘Acc’, and the target model refers to two groups of business terms: ‘Customers’ and ‘Residential Customer’. Because the collaboration model is an extension of a schema mapping model, it is straightforward to apply schema mapping technology to apply code generators to translate it into runtime artifacts such as SQL query, XSLT or ETL jobs [6] [5]. Figure 3: Collaboration model instance. In addition to generating runtime artifacts, Dot2Dot uses this model to facilitate collaboration by generating data models for the sources and targets of the mapping, and by supporting a higherlevel business view of the mappings. 4.THE BUSINESS ANALYST’S PERSPECTIVE The collaboration to build the integration application typically starts with a business analyst. The analyst feels most comfortable using web clients, spreadsheets and word processing software. A common deliverable for a business analyst is a spreadsheet that captures business rules to describe how to compute the integrated target information from available source systems. It is not unheard for a midsize enterprise to have thousands of such spreadsheets sprinkled throughout file systems and email messages. While data models for the source systems usually exist, the target data model for the new application may not yet exist, or the application may require modifications to an existing target data model. Thus, a common paradigm for the business analyst is to create mappings that refer to physical data columns on the source, and business terms for the target that represent the business meaning of the information to compute. Examples of such terms might be ‘High Value Customer Identifier’ or ‘Customer Name’. The business analyst often groups terms into categories that represent business objects, such as ‘Customers’ or ‘Residential Customer’. Figure 4: Business analyst’s specification. Figure 4 shows an example of a specification for an application that separates residential customers from a larger customer base. The first mapping contains a business rule but no expression, and describes that the physical source column ‘All_Cust.Nm’ should be transformed into the three target bits of information: Customers.GivenName, Customers.MiddleName and Customers.Sur-Name. The second row defines a mapping from the ‘Acc.CrdRsk-Rtg’ source column into the ‘ResidentialCustomers.HighValue-Identifier’ column and contains an expression as well as a business rule that indicates that high value customers should be identified based on their credit risk rating. Note that the target model is represented as a collection of categories and terms. The spreadsheet provides a convenient mechanism for the analyst to collect the subset of business terms he requires in one place, and enables him to complete work ahead of and independently from the physical implementation of the target model. In the next section, we will see how this collection of business terms and categories can be generated into a draft data model for the target that can be provided to the data modeler. 5.THE DATA MODELER’S PERSPECTIVE In our experience, a common source of false starts in the collaboration process revolves around defining a target data model that reflects a common understanding of the business analyst and the data modeler. One approach is to drive it from the perspective of the business analyst who might communicate his requirements verbally or via email and then leave it to the data modeler to intuit how to translate those terms into a target data model. The other approach is to drive it from the perspective of the data modeler, who might define a draft data model and share visual E-R diagrams [1] [3] for the business analyst to interpret and determine whether the model can support the information he wishes to compute. The content of the diagrams is often difficult for the business analyst to absorb, and customers report that the back-andforth iterations between terms and data model is a rich source of conflict and misunderstanding in the collaboration process. Dot2Dot enables both the business analyst and the data modeler to maintain their respective perspectives and iteratively work towards a data model. The business analyst’s selection of business terms and categories in his spreadsheet are captured in the Dot2Dot collaboration model. Observe that categories roughly correspond to entities, and terms correspond to attributes. Once the business analyst completes his work, Dot2Dot analyzes the structure of the terms and categories in his spreadsheet and translates them into either a logical or physical data model. This process is especially effective if the business analyst stores business terms in the Dot2Dot metadata repository along with naming standards that govern proper abbreviations and data types, etc. Dot2Dot can apply these standards as it generates the data model. Once the data model is generated, it can be used to populate the collaboration model with the appropriate target columns for the analyst’s spreadsheet. However, these entries will be placeholders for the final data model that will be provided by the data modeler. The draft data model can be shared with the data modeler using the most convenient form, such as a DDL script for a physical model, or through a metabroker that exchanges information with various data modeling tools [2] [7]. The second alternative is especially powerful if the modeling tool supports glossary models as well as logical and physical data models to enable the data modeler to see full relationships between the term, logical and physical data models. For example, Figure 5 shows the models and relationships generated by Dot2Dot for the example given in Figure 3. Notice that the logical model and physical model are annotated with data type hints, and that the physical model column names are abbreviations. This is a result of Dot2Dot applying the naming standards associated with the business terms as part of model generation. Once the data modeler receives the model(s), he can further refine and extend the business analyst’s draft model using standard modeling practices. The important consideration is that the modeler isn’t starting from scratch, but rather, from a model that reflects exactly the information the business analyst requires for the new application. Furthermore, because naming standards were applied as the model was generated, the model may already reflect many best practices, thus reducing the workload for the data modeler. Figure 5: Data models (and links) created by Dot2Dot. The final product for the data modeler will be a physical model that represents constraints, specific data types and additional details such as primary-foreign key relationships and indices. In addition to being deployed to vendor-specific system, this model can be re-imported into Dot2Dot and used to update the collaboration model. Figure 6 shows the updated collaboration model for the example of Figure 3, and Figure 7 shows the updated perspective for the business analyst. If links have been maintained between the term-based, logical and physical models in the modeling tool, then the update can be done automatically. However, it is possible that the links have not been maintained in the data modeling tool, and or that the names of columns and tables have been changed. In this case, updating the collaboration model may require reconciliation between the original placeholder model and the model provided by the data modeler. Figure 6: Collaboration model instance with the target model. Dot2Dot includes discovery algorithms that leverage synonyms, abbreviations, and lexical analysis to either automatically re-establish the links between the imported physical model and the columns and terms in the collaboration model, or to provide alternatives to the business analyst from which to visually inspect and select. At this point, Dot2Dot has facilitated collaboration between the business analyst and the data modeler, and the collaboration model reflects the result of their collaboration. Figure 7: Business analyst’s spreadsheet with the targets. Each was allowed to use the tool of choice, and Dot2Dot automated the information flow between them so that the collaboration was incremental rather than a series of fresh starts. 6.THE DEVELOPER’S PERSPECTIVE Next up in the collaboration process is the developer. The set of mappings captured in a specification defines the logic for a developer to implement in order to transform the source data into the desired target data. The transfer of the specification from the business analyst to the developer is yet another common point to introduce misunderstanding in the collaboration process. The specification is a spreadsheet, written at a higher level than what is technically required at implementation level, making it difficult for the developer to interpret. On the other hand, a section of code or pseudo-code that a developer might share with the business analyst is rich with technical detail but lacks the business context, which makes it difficult for the business analyst to understand. To flow information from the business analyst to the developer, Dot2Dot uses well-studied schema mapping techniques to exploit the semantic information captured in the collaboration model [5]. The code generation can exploit schema, mapping, expression and rule definition information to create artifacts such as SQL query, XSLT or ETL jobs. The level of detail that is contained in the collaboration model determines the degree of completeness for the code generation. If a mapping contains expressions, then Dot2Dot is able to completely generate the corresponding logic. If a mapping only captures business rules, the generated code will not be complete. It will, however, reflect the intentions of the business analyst in context for the developer; business rules will be connected to the appropriate sources and targets, making it much easier for the developer to understand the business analyst’s intentions, and reducing the amount of code he will have to write from scratch. An ETL job generated by Dot2Dot for the collaboration model instance shown in Figure 6 is shown in Figure 8. Figure 8: An ETL job created by Dot2Dot for the developer. In this example, Dot2Dot generated two source and two target data connections based on references to source and target tables of the collaboration model instance. The join operation was inferred from the metadata for the sources, which included information about the join condition. The analyst can override the default interpretation of how to merge the tables, if, for example, more than one join condition exists or if the application requires a different kind of join. The output of the join is flowed into two target connections that represent the physical schema information for the CUST and RES_CUST tables. The conditional expression of the second mapping resulted in a transformation operation for the ‘Acc’ stream, which is also annotated with the business rule (not shown). The first mapping in the specification connected ‘All_Cust.Nm’ on the source to customer’s first, middle and surname on the target by means of a business rule. Since it is a business rule, Dot2Dot2 generated an interface called TransformationInterface and connected it to the ‘All_Cust’ source table. As shown in Figure 9, the developer can open the interface and sees the business rule and its inputs and outputs as defined by the business analyst, all within the context of the ETL flow rather than a spreadsheet. The developer can use the rule as a guide to implement the logic and replace the ‘Placeholder’ operator with a corresponding ETL operator to implement the given business rule. Figure 9: Development interface At this point, the business analyst, data modeler and developer have collaborated to produce the first version of the application and data model. Each was allowed to use their tool of choice, and relied on Dot2Dot to flow information between them. 7.EMPIRICAL RESULTS Dot2Dot has been implemented and successfully used in several customer environments and users have reported productivity gains from 40% to 70%. Much of the gain can be attributed to decreased communication time between team members, standardization of artifacts, and automation of documentation and tedious tasks. In addition, users often cite another important factor: the centralized management that Dot2Dot provides. Rather than having spreadsheets, code and data models spread out in various emails and file systems, Dot2Dot provides single environment to capture business logic and link it to the corresponding IT artifacts. 8.SUMMARY AND FUTURE WORK Information integration is a collaborative effort, and requires users with different skill sets using different tools to reach a shared understanding of the business problem. In this paper, we have described how the Dot2Dot system starts to address this problem by extending standard schema mapping techniques to maintain different perspectives of an information integration application and to automate the information flow between the users responsible for building the application. By extending schema mapping technology to provide a common collaboration model between participating user perspectives, Dot2Dot enables a more independent and less error-prone work environment, achieves a high degree of integration and reusability between the perspectives, and leads to reduced effort and time-to-completion for information integration application projects. In future work, we will explore how to bring in other roles and in-formation such as the higher-level requirements of business users,as well as how to handle a richer set of data types, such as XMLand unstructured data. Another common scenario we will exploreoccurs when an existing application must be understood and mod-ified; however, no specification exists for it. This is a roadblockfor the collaborators to reach a common understanding about whatthe application does, or to make modifications. 9.REFERENCES[1] F. Abbattista, F. Lanubile, and G. Visaggio, “Recovering Conceptual Data Models is Human-Intensive”. In Proc. of 5th Intl. Conf. onSoftware Engineering and Knowledge Engineering, San Francisco,California, USA, pages 534--543, 1993. [2] CA ErwinTM Data Modeler, http://ca.com/us/products/product.aspx?id=260 [3] Carlo Batini, et.al., “Conceptual Database Design: An Entity-Relationship Approach”. 1992, Benjamin/Cummings, ISBN 0-8053-0244-1 [4] Philip A. Bernstein, Howard Ho, “Model Management and Schema Mappings: Theory and Practice”. VLDB 2007: 1439-1440. [5] S. Dessloch, et.al., “Orchid: Integrating Schema Mapping and ETL”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Negative Effects of the COVID-19 Pandemic on Home Integration, Community Integration, and Productive Activities

Objective: Social participation in daily living the activities requires the maintenance of a variety of social relationships with others and engagement in various social activities. Proper social participation increases the feeling of attachment, provides a stable sense of identity, and increases one’s sense of worth, belonging, and dependence on society. Lack of social participation leads to a...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Care Integration – From “One Size Fits All” to Person Centred Care; Comment on “Achieving Integrated Care for Older People: Shuffling the Deckchairs or Making the System Watertight for the Future?”

Integrating services is a hot topic amongst health system policy-makers and healthcare managers. There is some evidence that integrated services deliver efficiencies and reduce service utilisation rates for some patient populations. In their article on Achieving Integrated Care for Older People, Gillian Harvey and her colleagues formulate some critical insights from practice and research around...

متن کامل

Integration and Reduction of Microarray Gene Expressions Using an Information Theory Approach

The DNA microarray is an important technique that allows researchers to analyze many gene expression data in parallel. Although the data can be more significant if they come out of separate experiments, one of the most challenging phases in the microarray context is the integration of separate expression level datasets that have gathered through different techniques. In this paper, we prese...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Community Integration for After Acquired Brain Injury: A Literature Review

Objectives: This paper reviews the current literature on acquired brain injury (ABI) with a focus on ABI burden, importance of community integration, and community integration definitions suggested by the literature. Methods: Literature review Results: Acquired brain injury (ABI) is referred to a diverse range of disabilities resulted of injury in different parts of the brain. People with AB...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008